House Prices: Advanced Regression Techniques

Kaggle Competition

Problem Summary

  • Problem Type: Regression
  • Target: SalePrice (continuous)
  • There are 81 attributes. 1460 observations in the training set

Each record in the database describes a Boston suburb or town.

I'll evaluate the algorithms' performance using RMSE and R2 metrics.

RMSE will give a gross idea of how wrong all predictions are (0 is perfect) and R2 will give an idea of how well the model has fit the data (1 is perfect, 0 is worst).

Results Summary

My best-performing model had a RMSE score of 2.81 on an unseen validation set. It used a tuned SVM algorithm. The second best was a tuned Cubist algorithm with an RMSE score of 2.90 on a validation set.

Setup


In [111]:
# first install packages devtools and pacman manually

#pull in source functions from github
devtools::source_url('https://raw.githubusercontent.com/jsphyg/Machine_Learning_Notebooks/master/myRfunctions.R')
#source("C:\\Work\\myRfunctions.R")
fnRunDate()
fnInstallPackages()


SHA-1 hash of file is bfa47fa27ed24887838ddcd2c9aa1d862c080011
'Project last run on Fri Sep 22 2:08:49 PM 2017'
'Package install completed'

In [112]:
# import data changing any blanks and string NAs to NAs
dataset <- read_csv("C:\\Work\\kaggle_house_prices\\train.csv", na = c("","NA"))
test <- read_csv("C:\\Work\\kaggle_house_prices\\test.csv", na = c("","NA"))



head(dataset)
head(test)


Parsed with column specification:
cols(
  .default = col_character(),
  Id = col_integer(),
  MSSubClass = col_integer(),
  LotFrontage = col_integer(),
  LotArea = col_integer(),
  OverallQual = col_integer(),
  OverallCond = col_integer(),
  YearBuilt = col_integer(),
  YearRemodAdd = col_integer(),
  MasVnrArea = col_integer(),
  BsmtFinSF1 = col_integer(),
  BsmtFinSF2 = col_integer(),
  BsmtUnfSF = col_integer(),
  TotalBsmtSF = col_integer(),
  `1stFlrSF` = col_integer(),
  `2ndFlrSF` = col_integer(),
  LowQualFinSF = col_integer(),
  GrLivArea = col_integer(),
  BsmtFullBath = col_integer(),
  BsmtHalfBath = col_integer(),
  FullBath = col_integer()
  # ... with 18 more columns
)
See spec(...) for full column specifications.
Parsed with column specification:
cols(
  .default = col_character(),
  Id = col_integer(),
  MSSubClass = col_integer(),
  LotFrontage = col_integer(),
  LotArea = col_integer(),
  OverallQual = col_integer(),
  OverallCond = col_integer(),
  YearBuilt = col_integer(),
  YearRemodAdd = col_integer(),
  MasVnrArea = col_integer(),
  BsmtFinSF1 = col_integer(),
  BsmtFinSF2 = col_integer(),
  BsmtUnfSF = col_integer(),
  TotalBsmtSF = col_integer(),
  `1stFlrSF` = col_integer(),
  `2ndFlrSF` = col_integer(),
  LowQualFinSF = col_integer(),
  GrLivArea = col_integer(),
  BsmtFullBath = col_integer(),
  BsmtHalfBath = col_integer(),
  FullBath = col_integer()
  # ... with 17 more columns
)
See spec(...) for full column specifications.
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
1 60 RL 65 8450 Pave NA Reg Lvl AllPub ... 0 NA NA NA 0 2 2008 WD Normal 208500
2 20 RL 80 9600 Pave NA Reg Lvl AllPub ... 0 NA NA NA 0 5 2007 WD Normal 181500
3 60 RL 68 11250 Pave NA IR1 Lvl AllPub ... 0 NA NA NA 0 9 2008 WD Normal 223500
4 70 RL 60 9550 Pave NA IR1 Lvl AllPub ... 0 NA NA NA 0 2 2006 WD Abnorml140000
5 60 RL 84 14260 Pave NA IR1 Lvl AllPub ... 0 NA NA NA 0 12 2008 WD Normal 250000
6 50 RL 85 14115 Pave NA IR1 Lvl AllPub ... 0 NA MnPrv Shed 700 10 2009 WD Normal 143000
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...ScreenPorchPoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleCondition
1461 20 RH 80 11622 Pave NA Reg Lvl AllPub... 120 0 NA MnPrv NA 0 6 2010 WD Normal
1462 20 RL 81 14267 Pave NA IR1 Lvl AllPub... 0 0 NA NA Gar2 12500 6 2010 WD Normal
1463 60 RL 74 13830 Pave NA IR1 Lvl AllPub... 0 0 NA MnPrv NA 0 3 2010 WD Normal
1464 60 RL 78 9978 Pave NA IR1 Lvl AllPub... 0 0 NA NA NA 0 6 2010 WD Normal
1465 120 RL 43 5005 Pave NA IR1 HLS AllPub... 144 0 NA NA NA 0 1 2010 WD Normal
1466 60 RL 75 10000 Pave NA IR1 Lvl AllPub... 0 0 NA NA NA 0 4 2010 WD Normal

In [113]:
Hmisc::describe(dataset, listunique=1)


dataset 

 81  Variables      1460  Observations
--------------------------------------------------------------------------------
Id 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0    1460       1   730.5   73.95  146.90  365.75  730.50 1095.25 
    .90     .95 
1314.10 1387.05 

lowest :    1    2    3    4    5, highest: 1456 1457 1458 1459 1460 
--------------------------------------------------------------------------------
MSSubClass 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      15    0.94    56.9      20      20      20      50      70 
    .90     .95 
    120     160 

           20 30 40 45  50  60 70 75 80 85 90 120 160 180 190
Frequency 536 69  4 12 144 299 60 16 58 20 52  87  63  10  30
%          37  5  0  1  10  20  4  1  4  1  4   6   4   1   2
--------------------------------------------------------------------------------
MSZoning 
      n missing  unique 
   1460       0       5 

          C (all) FV RH   RL  RM
Frequency      10 65 16 1151 218
%               1  4  1   79  15
--------------------------------------------------------------------------------
LotFrontage 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1201     259     110       1   70.05      34      44      59      69      80 
    .90     .95 
     96     107 

lowest :  21  24  30  32  33, highest: 160 168 174 182 313 
--------------------------------------------------------------------------------
LotArea 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0    1073       1   10517    3312    5000    7554    9478   11602 
    .90     .95 
  14382   17401 

lowest :   1300   1477   1491   1526   1533
highest:  70761 115149 159000 164660 215245 
--------------------------------------------------------------------------------
Street 
      n missing  unique 
   1460       0       2 

Grvl (6, 0%), Pave (1454, 100%) 
--------------------------------------------------------------------------------
Alley 
      n missing  unique 
     91    1369       2 

Grvl (50, 55%), Pave (41, 45%) 
--------------------------------------------------------------------------------
LotShape 
      n missing  unique 
   1460       0       4 

IR1 (484, 33%), IR2 (41, 3%), IR3 (10, 1%), Reg (925, 63%) 
--------------------------------------------------------------------------------
LandContour 
      n missing  unique 
   1460       0       4 

Bnk (63, 4%), HLS (50, 3%), Low (36, 2%), Lvl (1311, 90%) 
--------------------------------------------------------------------------------
Utilities 
      n missing  unique 
   1460       0       2 

AllPub (1459, 100%), NoSeWa (1, 0%) 
--------------------------------------------------------------------------------
LotConfig 
      n missing  unique 
   1460       0       5 

          Corner CulDSac FR2 FR3 Inside
Frequency    263      94  47   4   1052
%             18       6   3   0     72
--------------------------------------------------------------------------------
LandSlope 
      n missing  unique 
   1460       0       3 

Gtl (1382, 95%), Mod (65, 4%), Sev (13, 1%) 
--------------------------------------------------------------------------------
Neighborhood 
      n missing  unique 
   1460       0      25 

lowest : Blmngtn Blueste BrDale  BrkSide ClearCr
highest: Somerst StoneBr SWISU   Timber  Veenker 
--------------------------------------------------------------------------------
Condition1 
      n missing  unique 
   1460       0       9 

          Artery Feedr Norm PosA PosN RRAe RRAn RRNe RRNn
Frequency     48    81 1260    8   19   11   26    2    5
%              3     6   86    1    1    1    2    0    0
--------------------------------------------------------------------------------
Condition2 
      n missing  unique 
   1460       0       8 

          Artery Feedr Norm PosA PosN RRAe RRAn RRNn
Frequency      2     6 1445    1    2    1    1    2
%              0     0   99    0    0    0    0    0
--------------------------------------------------------------------------------
BldgType 
      n missing  unique 
   1460       0       5 

          1Fam 2fmCon Duplex Twnhs TwnhsE
Frequency 1220     31     52    43    114
%           84      2      4     3      8
--------------------------------------------------------------------------------
HouseStyle 
      n missing  unique 
   1460       0       8 

          1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer SLvl
Frequency    154     14    726      8     11    445     37   65
%             11      1     50      1      1     30      3    4
--------------------------------------------------------------------------------
OverallQual 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      10    0.95   6.099       4       5       5       6       7 
    .90     .95 
      8       8 

          1 2  3   4   5   6   7   8  9 10
Frequency 2 3 20 116 397 374 319 168 43 18
%         0 0  1   8  27  26  22  12  3  1
--------------------------------------------------------------------------------
OverallCond 
      n missing  unique    Info    Mean 
   1460       0       9    0.81   5.575 

          1 2  3  4   5   6   7  8  9
Frequency 1 5 25 57 821 252 205 72 22
%         0 0  2  4  56  17  14  5  2
--------------------------------------------------------------------------------
YearBuilt 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     112       1    1971    1916    1925    1954    1973    2000 
    .90     .95 
   2006    2007 

lowest : 1872 1875 1880 1882 1885, highest: 2006 2007 2008 2009 2010 
--------------------------------------------------------------------------------
YearRemodAdd 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      61       1    1985    1950    1950    1967    1994    2004 
    .90     .95 
   2006    2007 

lowest : 1950 1951 1952 1953 1954, highest: 2006 2007 2008 2009 2010 
--------------------------------------------------------------------------------
RoofStyle 
      n missing  unique 
   1460       0       6 

          Flat Gable Gambrel Hip Mansard Shed
Frequency   13  1141      11 286       7    2
%            1    78       1  20       0    0
--------------------------------------------------------------------------------
RoofMatl 
      n missing  unique 
   1460       0       8 

          ClyTile CompShg Membran Metal Roll Tar&Grv WdShake WdShngl
Frequency       1    1434       1     1    1      11       5       6
%               0      98       0     0    0       1       0       0
--------------------------------------------------------------------------------
Exterior1st 
      n missing  unique 
   1460       0      15 

          AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc
Frequency      20       1       2      50      1      61     222       1
%               1       0       0       3      0       4      15       0
          MetalSd Plywood Stone Stucco VinylSd Wd Sdng WdShing
Frequency     220     108     2     25     515     206      26
%              15       7     0      2      35      14       2
--------------------------------------------------------------------------------
Exterior2nd 
      n missing  unique 
   1460       0      16 

          AsbShng AsphShn Brk Cmn BrkFace CBlock CmentBd HdBoard ImStucc
Frequency      20       3       7      25      1      60     207      10
%               1       0       0       2      0       4      14       1
          MetalSd Other Plywood Stone Stucco VinylSd Wd Sdng Wd Shng
Frequency     214     1     142     5     26     504     197      38
%              15     0      10     0      2      35      13       3
--------------------------------------------------------------------------------
MasVnrType 
      n missing  unique 
   1452       8       4 

BrkCmn (15, 1%), BrkFace (445, 31%), None (864, 60%) 
Stone (128, 9%) 
--------------------------------------------------------------------------------
MasVnrArea 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1452       8     327    0.79   103.7       0       0       0       0     166 
    .90     .95 
    335     456 

lowest :    0    1   11   14   16, highest: 1115 1129 1170 1378 1600 
--------------------------------------------------------------------------------
ExterQual 
      n missing  unique 
   1460       0       4 

Ex (52, 4%), Fa (14, 1%), Gd (488, 33%), TA (906, 62%) 
--------------------------------------------------------------------------------
ExterCond 
      n missing  unique 
   1460       0       5 

          Ex Fa  Gd Po   TA
Frequency  3 28 146  1 1282
%          0  2  10  0   88
--------------------------------------------------------------------------------
Foundation 
      n missing  unique 
   1460       0       6 

          BrkTil CBlock PConc Slab Stone Wood
Frequency    146    634   647   24     6    3
%             10     43    44    2     0    0
--------------------------------------------------------------------------------
BsmtQual 
      n missing  unique 
   1423      37       4 

Ex (121, 9%), Fa (35, 2%), Gd (618, 43%), TA (649, 46%) 
--------------------------------------------------------------------------------
BsmtCond 
      n missing  unique 
   1423      37       4 

Fa (45, 3%), Gd (65, 5%), Po (2, 0%), TA (1311, 92%) 
--------------------------------------------------------------------------------
BsmtExposure 
      n missing  unique 
   1422      38       4 

Av (221, 16%), Gd (134, 9%), Mn (114, 8%), No (953, 67%) 
--------------------------------------------------------------------------------
BsmtFinType1 
      n missing  unique 
   1423      37       6 

          ALQ BLQ GLQ LwQ Rec Unf
Frequency 220 148 418  74 133 430
%          15  10  29   5   9  30
--------------------------------------------------------------------------------
BsmtFinSF1 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     637    0.97   443.6     0.0     0.0     0.0   383.5   712.2 
    .90     .95 
 1065.5  1274.0 

lowest :    0    2   16   20   24, highest: 1904 2096 2188 2260 5644 
--------------------------------------------------------------------------------
BsmtFinType2 
      n missing  unique 
   1422      38       6 

          ALQ BLQ GLQ LwQ Rec  Unf
Frequency  19  33  14  46  54 1256
%           1   2   1   3   4   88
--------------------------------------------------------------------------------
BsmtFinSF2 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     144    0.31   46.55     0.0     0.0     0.0     0.0     0.0 
    .90     .95 
  117.2   396.2 

lowest :    0   28   32   35   40, highest: 1080 1085 1120 1127 1474 
--------------------------------------------------------------------------------
BsmtUnfSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     780       1   567.2     0.0    74.9   223.0   477.5   808.0 
    .90     .95 
 1232.0  1468.0 

lowest :    0   14   15   23   26, highest: 2042 2046 2121 2153 2336 
--------------------------------------------------------------------------------
TotalBsmtSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     721       1    1057   519.3   636.9   795.8   991.5  1298.2 
    .90     .95 
 1602.2  1753.0 

lowest :    0  105  190  264  270, highest: 3094 3138 3200 3206 6110 
--------------------------------------------------------------------------------
Heating 
      n missing  unique 
   1460       0       6 

          Floor GasA GasW Grav OthW Wall
Frequency     1 1428   18    7    2    4
%             0   98    1    0    0    0
--------------------------------------------------------------------------------
HeatingQC 
      n missing  unique 
   1460       0       5 

           Ex Fa  Gd Po  TA
Frequency 741 49 241  1 428
%          51  3  17  0  29
--------------------------------------------------------------------------------
CentralAir 
      n missing  unique 
   1460       0       2 

N (95, 7%), Y (1365, 93%) 
--------------------------------------------------------------------------------
Electrical 
      n missing  unique 
   1459       1       5 

          FuseA FuseF FuseP Mix SBrkr
Frequency    94    27     3   1  1334
%             6     2     0   0    91
--------------------------------------------------------------------------------
1stFlrSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     753       1    1163   673.0   756.9   882.0  1087.0  1391.2 
    .90     .95 
 1680.0  1831.2 

lowest :  334  372  438  480  483, highest: 2633 2898 3138 3228 4692 
--------------------------------------------------------------------------------
2ndFlrSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     417    0.82     347     0.0     0.0     0.0     0.0   728.0 
    .90     .95 
  954.2  1141.0 

lowest :    0  110  167  192  208, highest: 1611 1796 1818 1872 2065 
--------------------------------------------------------------------------------
LowQualFinSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      24    0.05   5.845       0       0       0       0       0 
    .90     .95 
      0       0 

lowest :   0  53  80 120 144, highest: 513 514 515 528 572 
--------------------------------------------------------------------------------
GrLivArea 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     861       1    1515     848     912    1130    1464    1777 
    .90     .95 
   2158    2466 

lowest :  334  438  480  520  605, highest: 3627 4316 4476 4676 5642 
--------------------------------------------------------------------------------
BsmtFullBath 
      n missing  unique    Info    Mean 
   1460       0       4    0.73  0.4253 

0 (856, 59%), 1 (588, 40%), 2 (15, 1%), 3 (1, 0%) 
--------------------------------------------------------------------------------
BsmtHalfBath 
      n missing  unique    Info    Mean 
   1460       0       3    0.16 0.05753 

0 (1378, 94%), 1 (80, 5%), 2 (2, 0%) 
--------------------------------------------------------------------------------
FullBath 
      n missing  unique    Info    Mean 
   1460       0       4    0.77   1.565 

0 (9, 1%), 1 (650, 45%), 2 (768, 53%), 3 (33, 2%) 
--------------------------------------------------------------------------------
HalfBath 
      n missing  unique    Info    Mean 
   1460       0       3    0.71  0.3829 

0 (913, 63%), 1 (535, 37%), 2 (12, 1%) 
--------------------------------------------------------------------------------
BedroomAbvGr 
      n missing  unique    Info    Mean 
   1460       0       8    0.82   2.866 

          0  1   2   3   4  5 6 8
Frequency 6 50 358 804 213 21 7 1
%         0  3  25  55  15  1 0 0
--------------------------------------------------------------------------------
KitchenAbvGr 
      n missing  unique    Info    Mean 
   1460       0       4    0.13   1.047 

0 (1, 0%), 1 (1392, 95%), 2 (65, 4%), 3 (2, 0%) 
--------------------------------------------------------------------------------
KitchenQual 
      n missing  unique 
   1460       0       4 

Ex (100, 7%), Fa (39, 3%), Gd (586, 40%), TA (735, 50%) 
--------------------------------------------------------------------------------
TotRmsAbvGrd 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      12    0.96   6.518       4       5       5       6       7 
    .90     .95 
      9      10 

          2  3  4   5   6   7   8  9 10 11 12 14
Frequency 1 17 97 275 402 329 187 75 47 18 11  1
%         0  1  7  19  28  23  13  5  3  1  1  0
--------------------------------------------------------------------------------
Functional 
      n missing  unique 
   1460       0       7 

          Maj1 Maj2 Min1 Min2 Mod Sev  Typ
Frequency   14    5   31   34  15   1 1360
%            1    0    2    2   1   0   93
--------------------------------------------------------------------------------
Fireplaces 
      n missing  unique    Info    Mean 
   1460       0       4    0.81   0.613 

0 (690, 47%), 1 (650, 45%), 2 (115, 8%), 3 (5, 0%) 
--------------------------------------------------------------------------------
FireplaceQu 
      n missing  unique 
    770     690       5 

          Ex Fa  Gd Po  TA
Frequency 24 33 380 20 313
%          3  4  49  3  41
--------------------------------------------------------------------------------
GarageType 
      n missing  unique 
   1379      81       6 

          2Types Attchd Basment BuiltIn CarPort Detchd
Frequency      6    870      19      88       9    387
%              0     63       1       6       1     28
--------------------------------------------------------------------------------
GarageYrBlt 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1379      81      97       1    1979    1930    1945    1961    1980    2002 
    .90     .95 
   2006    2007 

lowest : 1900 1906 1908 1910 1914, highest: 2006 2007 2008 2009 2010 
--------------------------------------------------------------------------------
GarageFinish 
      n missing  unique 
   1379      81       3 

Fin (352, 26%), RFn (422, 31%), Unf (605, 44%) 
--------------------------------------------------------------------------------
GarageCars 
      n missing  unique    Info    Mean 
   1460       0       5     0.8   1.767 

           0   1   2   3 4
Frequency 81 369 824 181 5
%          6  25  56  12 0
--------------------------------------------------------------------------------
GarageArea 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     441       1     473     0.0   240.0   334.5   480.0   576.0 
    .90     .95 
  757.1   850.1 

lowest :    0  160  164  180  186, highest: 1220 1248 1356 1390 1418 
--------------------------------------------------------------------------------
GarageQual 
      n missing  unique 
   1379      81       5 

          Ex Fa Gd Po   TA
Frequency  3 48 14  3 1311
%          0  3  1  0   95
--------------------------------------------------------------------------------
GarageCond 
      n missing  unique 
   1379      81       5 

          Ex Fa Gd Po   TA
Frequency  2 35  9  7 1326
%          0  3  1  1   96
--------------------------------------------------------------------------------
PavedDrive 
      n missing  unique 
   1460       0       3 

N (90, 6%), P (30, 2%), Y (1340, 92%) 
--------------------------------------------------------------------------------
WoodDeckSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     274    0.86   94.24       0       0       0       0     168 
    .90     .95 
    262     335 

lowest :   0  12  24  26  28, highest: 668 670 728 736 857 
--------------------------------------------------------------------------------
OpenPorchSF 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     202    0.91   46.66       0       0       0      25      68 
    .90     .95 
    130     175 

lowest :   0   4   8  10  11, highest: 406 418 502 523 547 
--------------------------------------------------------------------------------
EnclosedPorch 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     120    0.37   21.95     0.0     0.0     0.0     0.0     0.0 
    .90     .95 
  112.0   180.1 

lowest :   0  19  20  24  30, highest: 301 318 330 386 552 
--------------------------------------------------------------------------------
3SsnPorch 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      20    0.05    3.41       0       0       0       0       0 
    .90     .95 
      0       0 

lowest :   0  23  96 130 140, highest: 290 304 320 407 508 
--------------------------------------------------------------------------------
ScreenPorch 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      76    0.22   15.06       0       0       0       0       0 
    .90     .95 
      0     160 

lowest :   0  40  53  60  63, highest: 385 396 410 440 480 
--------------------------------------------------------------------------------
PoolArea 
      n missing  unique    Info    Mean 
   1460       0       8    0.01   2.759 

             0 480 512 519 555 576 648 738
Frequency 1453   1   1   1   1   1   1   1
%          100   0   0   0   0   0   0   0
--------------------------------------------------------------------------------
PoolQC 
      n missing  unique 
      7    1453       3 

Ex (2, 29%), Fa (2, 29%), Gd (3, 43%) 
--------------------------------------------------------------------------------
Fence 
      n missing  unique 
    281    1179       4 

GdPrv (59, 21%), GdWo (54, 19%), MnPrv (157, 56%) 
MnWw (11, 4%) 
--------------------------------------------------------------------------------
MiscFeature 
      n missing  unique 
     54    1406       4 

Gar2 (2, 4%), Othr (2, 4%), Shed (49, 91%), TenC (1, 2%) 
--------------------------------------------------------------------------------
MiscVal 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      21     0.1   43.49       0       0       0       0       0 
    .90     .95 
      0       0 

lowest :     0    54   350   400   450, highest:  2000  2500  3500  8300 15500 
--------------------------------------------------------------------------------
MoSold 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0      12    0.99   6.322       2       3       5       6       8 
    .90     .95 
     10      11 

           1  2   3   4   5   6   7   8  9 10 11 12
Frequency 58 52 106 141 204 253 234 122 63 89 79 59
%          4  4   7  10  14  17  16   8  4  6  5  4
--------------------------------------------------------------------------------
YrSold 
      n missing  unique    Info    Mean 
   1460       0       5    0.96    2008 

          2006 2007 2008 2009 2010
Frequency  314  329  304  338  175
%           22   23   21   23   12
--------------------------------------------------------------------------------
SaleType 
      n missing  unique 
   1460       0       9 

          COD Con ConLD ConLI ConLw CWD New Oth   WD
Frequency  43   2     9     5     5   4 122   3 1267
%           3   0     1     0     0   0   8   0   87
--------------------------------------------------------------------------------
SaleCondition 
      n missing  unique 
   1460       0       6 

          Abnorml AdjLand Alloca Family Normal Partial
Frequency     101       4     12     20   1198     125
%               7       0      1      1     82       9
--------------------------------------------------------------------------------
SalePrice 
      n missing  unique    Info    Mean     .05     .10     .25     .50     .75 
   1460       0     663       1  180921   88000  106475  129975  163000  214000 
    .90     .95 
 278000  326100 

lowest :  34900  35311  37900  39300  40000
highest: 582933 611657 625000 745000 755000 
--------------------------------------------------------------------------------

In [114]:
psych::describe(dataset, check = T, skew = TRUE, ranges = TRUE, quant = TRUE)



In [115]:
fnMissingDataPercent(data = dataset)


PoolQC
99.52
MiscFeature
96.3
Alley
93.77
Fence
80.75
FireplaceQu
47.26
LotFrontage
17.74
GarageType
5.55
GarageYrBlt
5.55
GarageFinish
5.55
GarageQual
5.55
GarageCond
5.55
BsmtExposure
2.6
BsmtFinType2
2.6
BsmtQual
2.53
BsmtCond
2.53
BsmtFinType1
2.53
MasVnrType
0.55
MasVnrArea
0.55
Electrical
0.07
Id
0
MSSubClass
0
MSZoning
0
LotArea
0
Street
0
LotShape
0
LandContour
0
Utilities
0
LotConfig
0
LandSlope
0
Neighborhood
0
Condition1
0
Condition2
0
BldgType
0
HouseStyle
0
OverallQual
0
OverallCond
0
YearBuilt
0
YearRemodAdd
0
RoofStyle
0
RoofMatl
0
Exterior1st
0
Exterior2nd
0
ExterQual
0
ExterCond
0
Foundation
0
BsmtFinSF1
0
BsmtFinSF2
0
BsmtUnfSF
0
TotalBsmtSF
0
Heating
0
HeatingQC
0
CentralAir
0
1stFlrSF
0
2ndFlrSF
0
LowQualFinSF
0
GrLivArea
0
BsmtFullBath
0
BsmtHalfBath
0
FullBath
0
HalfBath
0
BedroomAbvGr
0
KitchenAbvGr
0
KitchenQual
0
TotRmsAbvGrd
0
Functional
0
Fireplaces
0
GarageCars
0
GarageArea
0
PavedDrive
0
WoodDeckSF
0
OpenPorchSF
0
EnclosedPorch
0
3SsnPorch
0
ScreenPorch
0
PoolArea
0
MiscVal
0
MoSold
0
YrSold
0
SaleType
0
SaleCondition
0
SalePrice
0

In [116]:
# combine test and train.
dataset <- dplyr::bind_rows(dataset, test)

In [117]:
# quickly replace NAs. if numeric, replace with -1, if character replace with 'unknown'
# this gets rid of all NAs
dataset <- dataset %>% mutate_if(is.numeric, funs(ifelse(is.na(.), -1, .)))
dataset <- dataset %>% mutate_if(is.character, funs(ifelse(is.na(.), 'unknown', .)))

In [118]:
# combine the data to create dummy variables with caret. if data is split in train and test, i get errors

#create dummy variables
dmy <- caret::dummyVars(" ~ .", data = dataset, fullRank = T)
dataset <- as_tibble(predict(dmy, newdata = dataset))

#make the names usable in R
names(dataset) <- make.names(names(dataset), unique = TRUE)

dim(dataset)
head(test)
str(dataset)


  1. 2919
  2. 270
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...ScreenPorchPoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleCondition
1461 20 RH 80 11622 Pave NA Reg Lvl AllPub... 120 0 NA MnPrv NA 0 6 2010 WD Normal
1462 20 RL 81 14267 Pave NA IR1 Lvl AllPub... 0 0 NA NA Gar2 12500 6 2010 WD Normal
1463 60 RL 74 13830 Pave NA IR1 Lvl AllPub... 0 0 NA MnPrv NA 0 3 2010 WD Normal
1464 60 RL 78 9978 Pave NA IR1 Lvl AllPub... 0 0 NA NA NA 0 6 2010 WD Normal
1465 120 RL 43 5005 Pave NA IR1 HLS AllPub... 144 0 NA NA NA 0 1 2010 WD Normal
1466 60 RL 75 10000 Pave NA IR1 Lvl AllPub... 0 0 NA NA NA 0 4 2010 WD Normal
Classes 'tbl_df', 'tbl' and 'data.frame':	2919 obs. of  270 variables:
 $ Id                  : num  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass          : num  60 20 60 70 60 50 20 60 50 190 ...
 $ MSZoningFV          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSZoningRH          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ MSZoningRL          : num  1 1 1 1 1 1 1 1 0 1 ...
 $ MSZoningRM          : num  0 0 0 0 0 0 0 0 1 0 ...
 $ MSZoningunknown     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotFrontage         : num  65 80 68 60 84 85 75 -1 51 50 ...
 $ LotArea             : num  8450 9600 11250 9550 14260 ...
 $ StreetPave          : num  1 1 1 1 1 1 1 1 1 1 ...
 $ AlleyPave           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Alleyunknown        : num  1 1 1 1 1 1 1 1 1 1 ...
 $ LotShapeIR2         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotShapeIR3         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotShapeReg         : num  1 1 0 0 0 0 1 0 1 1 ...
 $ LandContourHLS      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LandContourLow      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LandContourLvl      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ UtilitiesNoSeWa     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Utilitiesunknown    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotConfigCulDSac    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotConfigFR2        : num  0 1 0 0 1 0 0 0 0 0 ...
 $ LotConfigFR3        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LotConfigInside     : num  1 0 1 0 0 1 1 0 1 0 ...
 $ LandSlopeMod        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LandSlopeSev        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodBlueste : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodBrDale  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodBrkSide : num  0 0 0 0 0 0 0 0 0 1 ...
 $ NeighborhoodClearCr : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodCollgCr : num  1 0 1 0 0 0 0 0 0 0 ...
 $ NeighborhoodCrawfor : num  0 0 0 1 0 0 0 0 0 0 ...
 $ NeighborhoodEdwards : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodGilbert : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodIDOTRR  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodMeadowV : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodMitchel : num  0 0 0 0 0 1 0 0 0 0 ...
 $ NeighborhoodNAmes   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodNoRidge : num  0 0 0 0 1 0 0 0 0 0 ...
 $ NeighborhoodNPkVill : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodNridgHt : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodNWAmes  : num  0 0 0 0 0 0 0 1 0 0 ...
 $ NeighborhoodOldTown : num  0 0 0 0 0 0 0 0 1 0 ...
 $ NeighborhoodSawyer  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodSawyerW : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodSomerst : num  0 0 0 0 0 0 1 0 0 0 ...
 $ NeighborhoodStoneBr : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodSWISU   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodTimber  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ NeighborhoodVeenker : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Condition1Feedr     : num  0 1 0 0 0 0 0 0 0 0 ...
 $ Condition1Norm      : num  1 0 1 1 1 1 1 0 0 0 ...
 $ Condition1PosA      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition1PosN      : num  0 0 0 0 0 0 0 1 0 0 ...
 $ Condition1RRAe      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition1RRAn      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition1RRNe      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition1RRNn      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition2Feedr     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition2Norm      : num  1 1 1 1 1 1 1 1 1 0 ...
 $ Condition2PosA      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition2PosN      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition2RRAe      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition2RRAn      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Condition2RRNn      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ BldgType2fmCon      : num  0 0 0 0 0 0 0 0 0 1 ...
 $ BldgTypeDuplex      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ BldgTypeTwnhs       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ BldgTypeTwnhsE      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ HouseStyle1.5Unf    : num  0 0 0 0 0 0 0 0 0 1 ...
 $ HouseStyle1Story    : num  0 1 0 0 0 0 1 0 0 0 ...
 $ HouseStyle2.5Fin    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ HouseStyle2.5Unf    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ HouseStyle2Story    : num  1 0 1 1 1 0 0 1 0 0 ...
 $ HouseStyleSFoyer    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ HouseStyleSLvl      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ OverallQual         : num  7 6 7 7 8 5 8 7 7 5 ...
 $ OverallCond         : num  5 8 5 5 5 5 5 6 5 6 ...
 $ YearBuilt           : num  2003 1976 2001 1915 2000 ...
 $ YearRemodAdd        : num  2003 1976 2002 1970 2000 ...
 $ RoofStyleGable      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ RoofStyleGambrel    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofStyleHip        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofStyleMansard    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofStyleShed       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofMatlCompShg     : num  1 1 1 1 1 1 1 1 1 1 ...
 $ RoofMatlMembran     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofMatlMetal       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofMatlRoll        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofMatlTar.Grv     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofMatlWdShake     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ RoofMatlWdShngl     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Exterior1stAsphShn  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Exterior1stBrkComm  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Exterior1stBrkFace  : num  0 0 0 0 0 0 0 0 1 0 ...
 $ Exterior1stCBlock   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Exterior1stCemntBd  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Exterior1stHdBoard  : num  0 0 0 0 0 0 0 1 0 0 ...
 $ Exterior1stImStucc  : num  0 0 0 0 0 0 0 0 0 0 ...
  [list output truncated]

In [119]:
# split back out the dataset and test
test <- dplyr::filter(dataset, Id > 1460) 
dataset <- dplyr::filter(dataset, Id <= 1460)


#drop the sktest target column full of NAs
test$SalePrice <- NULL

dim(dataset)
dim(test)


  1. 1460
  2. 270
  1. 1459
  2. 269

In [120]:
#set aside a final full set of data to train on before splitting a validation set
final_dataset <- dataset

# split a validation dataset
validation_index <- createDataPartition(dataset$SalePrice, p=0.80, list=FALSE)
validation <- dataset[-validation_index,]
dataset <- dataset[validation_index,]

In [121]:
dataset_slice <- dplyr::select(dataset, everything()) %>% dplyr::slice(., 1:100)

In [122]:
formula <- SalePrice ~ .

In [134]:
# Ensemble Methods
ds <- dataset_slice

# try ensembles
control <- trainControl(method="cv", number=10)
metric <- "RMSE"
# Random Forest
set.seed(9)
fit.rf <- train(formula, data=ds, method="rf", preProc=c("medianImpute"), metric=metric, trControl=control, na.action = na.pass)
# Stochastic Gradient Boosting
set.seed(9)
fit.gbm <- train(formula, data=ds, method="gbm", preProc=c("medianImpute"),metric=metric, trControl=control, verbose=FALSE, na.action = na.pass)
# Cubist
set.seed(9)
fit.cubist <- train(formula, data=ds, method="cubist", preProc=c("medianImpute"),metric=metric, trControl=control, na.action = na.pass)
# xgb
set.seed(9)
fit.xgb <- train(formula, data=ds, method="xgbTree", preProc=c("medianImpute"),metric=metric, trControl=control, na.action = na.pass)
# Compare algorithms
ensemble_results <- resamples(list(RF=fit.rf, GBM=fit.gbm, CUBIST=fit.cubist, XGB=fit.xgb))
summary(ensemble_results)
bwplot(ensemble_results)


Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:Hmisc':

    combine

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin

Loading required package: gbm
Loading required package: splines
Loading required package: parallel
Loaded gbm 2.1.3
Loading required package: plyr
------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------

Attaching package: 'plyr'

The following objects are masked from 'package:Hmisc':

    is.discrete, summarize

The following object is masked from 'package:DMwR':

    join

The following object is masked from 'package:lubridate':

    here

The following objects are masked from 'package:dplyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following object is masked from 'package:purrr':

    compact

Loading required package: xgboost

Attaching package: 'xgboost'

The following object is masked from 'package:dplyr':

    slice

Call:
summary.resamples(object = ensemble_results)

Models: RF, GBM, CUBIST, XGB 
Number of resamples: 10 

RMSE 
        Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
RF     18450   22630  26890 30800   37920 51920    0
GBM    15690   20380  29320 28540   34470 43160    0
CUBIST  7971   20200  25640 26980   32980 45570    0
XGB    18980   22860  29200 29600   34930 42660    0

Rsquared 
         Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
RF     0.4583  0.8078 0.8295 0.8247  0.9021 0.9635    0
GBM    0.7479  0.8144 0.8547 0.8485  0.8946 0.9578    0
CUBIST 0.5044  0.7990 0.8939 0.8412  0.9171 0.9926    0
XGB    0.4435  0.7921 0.8593 0.8216  0.9098 0.9638    0

In [135]:
fit.cubist
plot(fit.cubist)


Cubist 

100 samples
269 predictors

Pre-processing: median imputation (269) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 88, 89, 89, 90, 91, 91, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE      Rsquared 
   1          0          32654.50  0.7994853
   1          5          28745.98  0.8031904
   1          9          29653.37  0.8166557
  10          0          27775.61  0.8338888
  10          5          27051.77  0.8335674
  10          9          26983.68  0.8411926
  20          0          27617.23  0.8388381
  20          5          27584.39  0.8300897
  20          9          27002.19  0.8466963

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were committees = 10 and neighbors = 9.

Cubist was best. First use random to find better tuning values. then tune around them.


In [136]:
# Tune the Cubist algorithm
ds <- dplyr::select(dataset, everything()) %>% dplyr::slice(., 1:1000)

control <- trainControl(method="cv", number=10, search='random')
metric <- "RMSE"
set.seed(7)
rand.cubist <- train(formula, data=ds, method="cubist", metric=metric, trControl=control, tuneLength = 20)
print(rand.cubist)
plot(rand.cubist)


Cubist 

1000 samples
 269 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 898, 900, 900, 901, 901, 900, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE      Rsquared 
    2         4          26897.35  0.8823320
    4         1          30915.70  0.8508779
   15         7          26499.00  0.8851216
   24         1          30823.87  0.8511905
   25         2          28459.79  0.8710751
   25         7          26633.79  0.8842461
   26         0          26379.56  0.8857581
   28         2          28442.14  0.8711172
   36         7          26751.72  0.8834508
   48         9          26570.75  0.8849160
   56         7          26735.78  0.8837600
   59         1          30848.77  0.8512799
   62         1          30872.81  0.8509946
   77         8          26361.98  0.8862668
   79         3          27564.60  0.8769157
   81         3          27513.80  0.8774290
   81         9          26315.59  0.8866018
   91         1          30726.54  0.8519100
  100         3          27540.22  0.8770529

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were committees = 81 and neighbors = 9.

In [142]:
# Tune the Cubist algorithm
control <- trainControl(method="cv", number=5)
metric <- "RMSE"
set.seed(7)
grid <- expand.grid(.committees=seq(5, 40, by=5), .neighbors=8)
tune.cubist <- train(formula, data=dataset, method="cubist", preProc=c("zv","medianImpute","BoxCox"),metric=metric, tuneGrid=grid, trControl=control, na.action = na.pass)
print(tune.cubist)
plot(tune.cubist)


Cubist 

1169 samples
 269 predictor

Pre-processing: median imputation (255), Box-Cox transformation (12),
 remove (14) 
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 935, 936, 935, 936, 934 
Resampling results across tuning parameters:

  committees  RMSE      Rsquared 
   5          36290.12  0.8012913
  10          35885.65  0.8033261
  15          35671.87  0.8049661
  20          35910.01  0.8028263
  25          35759.14  0.8046647
  30          35524.56  0.8067593
  35          35608.46  0.8061696
  40          35715.91  0.8049510

Tuning parameter 'neighbors' was held constant at a value of 8
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were committees = 30 and neighbors = 8.

In [137]:
set.seed(13)
predictions <- predict(rand.cubist, newdata=validation, na.action=na.pass)

MLmetrics::RMSE(predictions, validation$SalePrice)
MLmetrics::RMSLE(predictions, validation$SalePrice)


36266.6350361944
0.143377107091807

In [138]:
# make predictions on the test set for Kaggle submission
test$prediction <- predict(rand.cubist, newdata = test, na.action = na.pass)
head(test)
nrow(data.frame(test))


IdMSSubClassMSZoningFVMSZoningRHMSZoningRLMSZoningRMMSZoningunknownLotFrontageLotAreaStreetPave...SaleTypeNewSaleTypeOthSaleTypeunknownSaleTypeWDSaleConditionAdjLandSaleConditionAllocaSaleConditionFamilySaleConditionNormalSaleConditionPartialprediction
1461 20 0 1 0 0 0 80 11622 1 ... 0 0 0 1 0 0 0 1 0 129184.5
1462 20 0 0 1 0 0 81 14267 1 ... 0 0 0 1 0 0 0 1 0 162046.9
1463 60 0 0 1 0 0 74 13830 1 ... 0 0 0 1 0 0 0 1 0 182289.0
1464 60 0 0 1 0 0 78 9978 1 ... 0 0 0 1 0 0 0 1 0 191930.6
1465 120 0 0 1 0 0 43 5005 1 ... 0 0 0 1 0 0 0 1 0 182107.4
1466 60 0 0 1 0 0 75 10000 1 ... 0 0 0 1 0 0 0 1 0 177914.6
1459

In [139]:
my_solution <- dplyr::select(test, Id = Id, SalePrice = prediction)
my_solution$Id <- as.character(my_solution$Id)
readr::write_csv(x = data.frame(my_solution), path = "C:\\Work\\my_solution.csv")

head(my_solution, n=5)
tail(my_solution, n=5)


IdSalePrice
1461 129184.5
1462 162046.9
1463 182289.0
1464 191930.6
1465 182107.4
IdSalePrice
2915 94747.53
2916 81349.26
2917 175689.98
2918 116572.81
2919 228319.30

In [41]:
# correlation between results 
modelCor(ensemble_results) 
splom(ensemble_results)


RFGBMCUBISTXGB
RF1.00000000.97275540.84661150.9351958
GBM0.97275541.00000000.89898750.8833198
CUBIST0.84661150.89898751.00000000.6968309
XGB0.93519580.88331980.69683091.0000000

Cubist was the most accurate with an RMSE that was lower. I'll tune Cubist and see if I canget more out of it.

Cubist has two parameters that are tunable with caret: committees which is the number of boosting operations and neighbors which is used during prediction and is the number of instances used to correct the rule-based prediction (although the documentation is perhaps a little ambiguous on this).

For more information about Cubist see the help on the function ?cubist. Let’s first look at the default tuning parameter used by caret that resulted in our accurate model.


In [42]:
print(fit.cubist)


Cubist 

897 samples
 37 predictor

Pre-processing: median imputation (37) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 1052, 1052, 1052, 1053, 1053, 1052, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE      Rsquared 
   1          0          31444.04  0.8407646
   1          5          31227.82  0.8468689
   1          9          30726.75  0.8497293
  10          0          30193.08  0.8481363
  10          5          29988.91  0.8525039
  10          9          29587.85  0.8547257
  20          0          30087.44  0.8477844
  20          5          29773.66  0.8527302
  20          9          29449.11  0.8542496

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were committees = 20 and neighbors = 9.

We can see that the best RMSE was achieved with committees = 20 and neighbors = 5 Let’s use a grid search to tune around those values. We’ll try all committees between 15 and 25 and spot-check a neighbors value above and below 5.


In [50]:
# Tune the Cubist algorithm
control <- trainControl(method="cv", number=10)
metric <- "RMSE"
set.seed(7)
grid <- expand.grid(.committees=seq(90, 105, by=1), .neighbors=seq(8,9, by=1))
tune.cubist <- train(formula, data=dataset, method="cubist", preProc=c("medianImpute"),metric=metric, tuneGrid=grid, trControl=control, na.action = na.pass)
print(tune.cubist)
plot(tune.cubist)


Cubist 

897 samples
 37 predictor

Pre-processing: median imputation (37) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1053, 1052, 1053, 1052, 1053, 1051, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE      Rsquared 
   90         8          30630.31  0.8450439
   90         9          30588.15  0.8453062
   91         8          30567.88  0.8454597
   91         9          30525.36  0.8457290
   92         8          30617.43  0.8452248
   92         9          30575.54  0.8454881
   93         8          30522.99  0.8458538
   93         9          30479.80  0.8461305
   94         8          30544.03  0.8458157
   94         9          30501.19  0.8460874
   95         8          30508.47  0.8459756
   95         9          30465.05  0.8462542
   96         8          30490.35  0.8461352
   96         9          30447.80  0.8464053
   97         8          30472.73  0.8462456
   97         9          30431.11  0.8465093
   98         8          30502.67  0.8461257
   98         9          30461.84  0.8463816
   99         8          30484.91  0.8462641
   99         9          30443.26  0.8465282
  100         8          30527.75  0.8457794
  100         9          30486.82  0.8460370
  101         8               NaN        NaN
  101         9               NaN        NaN
  102         8               NaN        NaN
  102         9               NaN        NaN
  103         8               NaN        NaN
  103         9               NaN        NaN
  104         8               NaN        NaN
  104         9               NaN        NaN
  105         8               NaN        NaN
  105         9               NaN        NaN

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were committees = 97 and neighbors = 9.

In [45]:
# Tune the Cubist algorithm
control <- trainControl(method="cv", number=10, search='random')
metric <- "RMSE"
set.seed(13)
rand.cubist <- train(formula, data=dataset, method="cubist", preProc=c("medianImpute"), metric=metric, trControl=control, tuneLength = 20, na.action = na.pass)


Error in print(tune.cubist): object 'tune.cubist' not found
Traceback:

1. print(tune.cubist)

In [46]:
print(rand.cubist)
plot(rand.cubist)


Cubist 

897 samples
 37 predictor

Pre-processing: median imputation (37) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1051, 1053, 1052, 1052, 1053, 1052, ... 
Resampling results across tuning parameters:

  committees  neighbors  RMSE      Rsquared 
   3          9          31465.29  0.8365468
   4          6          32426.92  0.8268674
   6          6          31309.40  0.8348669
  17          1          35053.92  0.8097407
  24          8          31981.41  0.8300350
  29          0          32521.91  0.8241532
  30          8          31671.73  0.8322066
  32          1          34538.31  0.8137425
  49          5          31173.86  0.8357425
  55          0          31688.43  0.8298428
  59          6          31095.27  0.8364246
  73          8          30681.01  0.8393753
  76          6          30856.26  0.8384297
  79          4          31052.81  0.8363843
  82          7          30686.00  0.8394768
  88          5          30883.87  0.8382161
  89          8          30472.20  0.8411396
  91          0          31233.98  0.8337150
  95          8          30364.91  0.8420689
  96          0          31108.85  0.8349099

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were committees = 95 and neighbors = 8.

In [47]:
set.seed(13)
predictions <- predict(rand.cubist, newdata=validation, na.action=na.pass)

MLmetrics::RMSE(predictions, validation$SalePrice)
MLmetrics::RMSLE(predictions, validation$SalePrice)


28146.8998857966
0.117144239852521

In [48]:
# make predictions on the test set for Kaggle submission
test$prediction <- predict(rand.cubist, newdata = test, na.action = na.pass)
head(test)
nrow(data.frame(test))


IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionprediction
1461 20 RH 80 11622 Pave NA Reg Lvl AllPub ... 0 NA MnPrv NA 0 6 2010 WD Normal 126830.0
1462 20 RL 81 14267 Pave NA IR1 Lvl AllPub ... 0 NA NA Gar2 12500 6 2010 WD Normal 163595.7
1463 60 RL 74 13830 Pave NA IR1 Lvl AllPub ... 0 NA MnPrv NA 0 3 2010 WD Normal 182957.2
1464 60 RL 78 9978 Pave NA IR1 Lvl AllPub ... 0 NA NA NA 0 6 2010 WD Normal 193952.5
1465 120 RL 43 5005 Pave NA IR1 HLS AllPub ... 0 NA NA NA 0 1 2010 WD Normal 186825.6
1466 60 RL 75 10000 Pave NA IR1 Lvl AllPub ... 0 NA NA NA 0 4 2010 WD Normal 173457.3
1459

In [49]:
my_solution <- dplyr::select(test, Id = Id, SalePrice = prediction)
readr::write_csv(x = data.frame(my_solution), path = "C:\\Work\\my_solution.csv")

head(my_solution, n=20)
tail(my_solution, n=20)


IdSalePrice
1461 126830.02
1462 163595.69
1463 182957.19
1464 193952.52
1465 186825.59
1466 173457.34
1467 182741.45
1468 163119.27
1469 186589.91
1470 118209.41
1471 206801.23
1472 92614.41
1473 90469.55
1474 150452.75
1475 103124.55
1476 332124.84
1477 247964.45
1478 284008.00
1479 258617.81
1480 500307.91
IdSalePrice
2900 165006.59
2901 210373.39
2902 196022.58
2903 335850.72
2904 351279.34
2905 84391.13
2906 221652.44
2907 106752.44
2908 130474.32
2909 138493.48
2910 79982.55
2911 78002.70
2912 148282.50
2913 84687.27
2914 71243.44
2915 83983.96
2916 82510.62
2917 182881.77
2918 109052.62
2919 228521.91

In [ ]:


In [ ]:


In [ ]:


In [56]:
# lm
set.seed(7)
fit.lm1 <- train(formula, data=dataset, method="lm", metric=metric, trControl=control)
# GLM
set.seed(7)
fit.glm1 <- train(formula, data=dataset, method="glm", metric=metric, trControl=control)
# GLMNET
set.seed(7)
fit.glmnet1 <- train(formula, data=dataset, method="glmnet", metric=metric, trControl=control)
# SVM
set.seed(7)
fit.svm1 <- train(formula, data=dataset, method="svmRadial", metric=metric, trControl=control)
# CART
set.seed(7)
grid <- expand.grid(.cp=c(0, 0.05, 0.1))
fit.cart1 <- train(formula, data=dataset, method="rpart", metric=metric, tuneGrid=grid, trControl=control)
# kNN
set.seed(7)
fit.knn1 <- train(formula, data=dataset, method="knn", metric=metric, trControl=control)


# Compare algorithms
results <- resamples(list(LM1=fit.lm1, GLM1=fit.glm1, GLMNET1=fit.glmnet1, SVM1=fit.svm1, CART1=fit.cart1, KNN1=fit.knn1))
summary(results)
dotplot(results)


Loading required package: glmnet
Loading required package: Matrix

Attaching package: 'Matrix'

The following object is masked from 'package:tidyr':

    expand

Loading required package: foreach

Attaching package: 'foreach'

The following objects are masked from 'package:purrr':

    accumulate, when

Loaded glmnet 2.0-10

Loading required package: rpart
Call:
summary.resamples(object = results)

Models: LM1, GLM1, GLMNET1, SVM1, CART1, KNN1 
Number of resamples: 30 

RMSE 
         Min. 1st Qu. Median  Mean 3rd Qu.   Max. NA's
LM1     3.514   4.056  4.773 4.963   5.529  9.448    0
GLM1    3.514   4.056  4.773 4.963   5.529  9.448    0
GLMNET1 3.498   4.030  4.767 4.957   5.517  9.480    0
SVM1    2.410   2.836  3.272 3.537   3.870  6.708    0
CART1   2.797   3.434  4.272 4.531   5.437  9.248    0
KNN1    4.751   6.221  6.738 6.946   7.840 10.400    0

Rsquared 
          Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
LM1     0.3169  0.6682 0.7428 0.7293  0.7984 0.8882    0
GLM1    0.3169  0.6682 0.7428 0.7293  0.7984 0.8882    0
GLMNET1 0.3127  0.6678 0.7436 0.7296  0.7987 0.8905    0
SVM1    0.6539  0.8153 0.8843 0.8559  0.9160 0.9533    0
CART1   0.3614  0.6733 0.8197 0.7686  0.8618 0.9026    0
KNN1    0.1971  0.4022 0.4728 0.4679  0.5339 0.6475    0

In [57]:
# correlation between results 
modelCor(results) 
splom(results)


LM1GLM1GLMNET1SVM1CART1KNN1
LM11.00000001.00000000.99991380.76498140.72347650.8243609
GLM11.00000001.00000000.99991380.76498140.72347650.8243609
GLMNET10.99991380.99991381.00000000.76145320.72322340.8271828
SVM10.76498140.76498140.76145321.00000000.52168630.4984776
CART10.72347650.72347650.72322340.52168631.00000000.5559818
KNN10.82436090.82436090.82718280.49847760.55598181.0000000

Are high correlations among features preventing a better prediction?


In [58]:
# remove correlated attributes
# find attributes that are highly corrected
set.seed(7)
cutoff <- 0.70
correlations <- cor(dataset[,1:13])
highlyCorrelated <- findCorrelation(correlations, cutoff=cutoff)
for (value in highlyCorrelated) {
    print(names(dataset)[value])
}
# create a new dataset without highly corrected features
dataset_features <- dataset[,-highlyCorrelated]
dim(dataset_features)


[1] "indus"
[1] "nox"
[1] "tax"
[1] "dis"
  1. 407
  2. 10

4 features dropped for being over 70% correlated. Now I'll run the baseline again on the new dataset.


In [59]:
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "RMSE"

# lm
set.seed(7)
fit.lm <- train(medv~., data=dataset_features, method="lm", metric=metric, preProc=c("center", "scale"), trControl=control)
# GLM
set.seed(7)
fit.glm <- train(medv~., data=dataset_features, method="glm", metric=metric, preProc=c("center", "scale"), trControl=control)
# GLMNET
set.seed(7)
fit.glmnet <- train(medv~., data=dataset_features, method="glmnet", metric=metric, preProc=c("center", "scale"), trControl=control)
# SVM
set.seed(7)
fit.svm <- train(medv~., data=dataset_features, method="svmRadial", metric=metric, preProc=c("center", "scale"), trControl=control)
# CART
set.seed(7)
grid <- expand.grid(.cp=c(0, 0.05, 0.1))
fit.cart <- train(medv~., data=dataset_features, method="rpart", metric=metric, tuneGrid=grid, preProc=c("center", "scale"), trControl=control)
# kNN
set.seed(7)
fit.knn <- train(medv~., data=dataset_features, method="knn", metric=metric, preProc=c("center", "scale"), trControl=control)


# Compare algorithms
feature_results <- resamples(list(LM=fit.lm, GLM=fit.glm, GLMNET=fit.glmnet, SVM=fit.svm, CART=fit.cart, KNN=fit.knn))
summary(feature_results)
dotplot(feature_results)


Call:
summary.resamples(object = feature_results)

Models: LM, GLM, GLMNET, SVM, CART, KNN 
Number of resamples: 10 

RMSE 
        Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
LM     3.903   4.429  4.592 5.266   5.427 9.982    0
GLM    3.903   4.429  4.592 5.266   5.427 9.982    0
GLMNET 3.807   4.410  4.518 5.192   5.307 9.925    0
SVM    2.938   3.301  3.951 4.342   4.717 8.498    0
CART   2.661   3.293  3.952 4.389   4.435 9.558    0
KNN    2.722   3.416  4.195 4.463   4.785 8.889    0

Rsquared 
         Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
LM     0.2505  0.6819 0.7398 0.6995  0.8049 0.8877    0
GLM    0.2505  0.6819 0.7398 0.6995  0.8049 0.8877    0
GLMNET 0.2554  0.6929 0.7584 0.7083  0.8105 0.8901    0
SVM    0.4877  0.7096 0.8381 0.7853  0.8848 0.9116    0
CART   0.3310  0.7503 0.8239 0.7780  0.8636 0.9360    0
KNN    0.4105  0.7119 0.8158 0.7698  0.8758 0.9117    0

In [60]:
fit.svm


Support Vector Machines with Radial Basis Function Kernel 

407 samples
  9 predictor

Pre-processing: centered (9), scaled (9) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 366, 367, 366, 366, 367, 367, ... 
Resampling results across tuning parameters:

  C     RMSE      Rsquared 
  0.25  5.136846  0.7298906
  0.50  4.675332  0.7619518
  1.00  4.341606  0.7853084

Tuning parameter 'sigma' was held constant at a value of 0.1858149
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were sigma = 0.1858149 and C = 1.

Removing highly-correlated predictors helped.

Now I'll tune the SVM algorithm.

The C parameter is the cost constraint used by SVM. Learn more in the help for the ksvm function ?ksvm. We can see from previous results that a C value of 1.0 is a good starting point. Let’s design a grid search around a C value of 1. We might see a small trend of decreasing RMSE with increasing C, so let’s try all integer C values between 1 and 10. Another parameter that caret let’s us tune is the sigma parameter. This is a smoothing parameter. Good sigma values often start around 0.1, so we will try numbers before and after.


In [73]:
# tune SVM sigma and C parametres
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
set.seed(7)
grid <- expand.grid(.sigma=c(0.025, 0.05, 0.1, 0.15), .C=seq(1, 15, by=1))
fit.svm <- train(formula, data=dataset, method="svmRadial", metric=metric, tuneGrid=grid, preProc=c("BoxCox"), trControl=control)
print(fit.svm)
plot(fit.svm)


Support Vector Machines with Radial Basis Function Kernel 

407 samples
 13 predictor

Pre-processing: Box-Cox transformation (11) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 366, 367, 366, 366, 367, 367, ... 
Resampling results across tuning parameters:

  sigma  C   RMSE      Rsquared 
  0.025   1  3.889703  0.8335201
  0.025   2  3.685009  0.8470869
  0.025   3  3.562851  0.8553298
  0.025   4  3.453041  0.8628558
  0.025   5  3.372501  0.8686287
  0.025   6  3.306693  0.8731149
  0.025   7  3.261471  0.8761873
  0.025   8  3.232191  0.8780827
  0.025   9  3.208426  0.8797434
  0.025  10  3.186740  0.8812147
  0.025  11  3.169472  0.8824359
  0.025  12  3.155786  0.8835105
  0.025  13  3.145025  0.8843587
  0.025  14  3.132858  0.8851853
  0.025  15  3.120282  0.8860505
  0.050   1  3.771428  0.8438368
  0.050   2  3.484116  0.8634056
  0.050   3  3.282230  0.8768963
  0.050   4  3.179856  0.8829293
  0.050   5  3.105290  0.8873315
  0.050   6  3.054516  0.8907211
  0.050   7  3.024010  0.8925927
  0.050   8  3.003371  0.8936101
  0.050   9  2.984457  0.8944677
  0.050  10  2.977085  0.8948000
  0.050  11  2.968672  0.8953416
  0.050  12  2.962058  0.8957037
  0.050  13  2.955985  0.8959431
  0.050  14  2.951290  0.8961327
  0.050  15  2.947907  0.8962569
  0.100   1  3.762027  0.8453751
  0.100   2  3.300432  0.8747723
  0.100   3  3.142907  0.8825268
  0.100   4  3.071231  0.8862783
  0.100   5  3.028898  0.8890841
  0.100   6  3.015042  0.8900253
  0.100   7  3.009815  0.8904964
  0.100   8  3.005077  0.8909034
  0.100   9  3.006147  0.8908668
  0.100  10  3.006943  0.8908635
  0.100  11  3.005785  0.8910132
  0.100  12  3.005781  0.8911024
  0.100  13  3.006638  0.8911363
  0.100  14  3.008683  0.8911141
  0.100  15  3.010580  0.8910613
  0.150   1  3.835849  0.8408209
  0.150   2  3.318208  0.8716379
  0.150   3  3.171005  0.8793969
  0.150   4  3.151071  0.8809872
  0.150   5  3.149461  0.8811425
  0.150   6  3.154374  0.8807765
  0.150   7  3.156741  0.8806358
  0.150   8  3.157200  0.8806536
  0.150   9  3.156256  0.8807690
  0.150  10  3.156134  0.8807506
  0.150  11  3.156458  0.8807279
  0.150  12  3.156249  0.8807845
  0.150  13  3.155070  0.8809160
  0.150  14  3.154128  0.8810077
  0.150  15  3.154153  0.8809820

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were sigma = 0.05 and C = 15.

We can see that the sigma values flatten out with larger C cost constraints. It looks like we might do well with a sigma of 0.05 and a C of 8. This gives us a respectable RMSE of 2.977085. If we wanted to take this further, we could try even more fine tuning with more grid searches. We could also explore trying to tune other parameters of the underlying ksvm() function. Finally and as already mentioned, we could perform some grid searches on the other nonlinear regression methods.


In [74]:
set.seed(13)
predictions <- predict(fit.svm, newdata=validation, na.action=na.pass)

MLmetrics::RMSE(predictions, validation$medv)


2.85376796253756

In [63]:
glimpse(dataset_features)


Observations: 407
Variables: 10
$ crim    <dbl> 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829, 0.144...
$ zn      <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 0.0, 0.0, ...
$ chas    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ rm      <dbl> 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631, 5.8...
$ age     <dbl> 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 39.0, 61.8...
$ rad     <dbl> 2, 2, 3, 3, 3, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
$ ptratio <dbl> 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 21.0,...
$ b       <dbl> 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90, 386...
$ lstat   <dbl> 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 15.71, 8...
$ medv    <dbl> 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 21.7, 20.4,...

In [68]:
# apply spatialSign
centerScale <- caret::preProcess(dataset_features[,1:9], method = c(
   "center","scale",
    "spatialSign" #needs to be scaled first
    ))
ssData <- predict(centerScale, newdata = dataset_features)

In [69]:
head(ssData)


crimznchasrmageradptratioblstatmedv
-0.3057415 -0.33539714-0.2050675 0.1414989 0.2536941 -0.6178320 -0.231812370.3182620 -0.3617341321.6
-0.1988610 -0.21814848-0.1333797 0.5876024 -0.1266633 -0.4018493 -0.150775040.1854101 -0.5613932134.7
-0.1962334 -0.21557402-0.1318056 0.4608017 -0.3729155 -0.3439066 0.045713980.1926592 -0.6235096533.4
-0.2102886 -0.23342609-0.1427207 0.6023791 -0.2565155 -0.3723861 0.049499640.2215005 -0.5119344536.2
-0.2707984 -0.29727713-0.1817603 0.1333720 -0.2261989 -0.4742481 0.063039690.2619902 -0.6624043928.7
-0.2326514 0.04737754-0.1587674 -0.2062440 -0.0434955 -0.2860895 -0.857030480.2381948 -0.0301326822.9

In [71]:
# tune SVM sigma and C parametres
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
set.seed(7)
grid <- expand.grid(.sigma=c(0.025, 0.05, 0.1, 0.15, 0.2, 0.25), .C=seq(1, 15, by=1))
fit.svm <- train(formula, data=ssData, method="svmRadial", metric=metric, tuneGrid=grid, preProc=c("YeoJohnson"), trControl=control)
print(fit.svm)
plot(fit.svm)


Support Vector Machines with Radial Basis Function Kernel 

407 samples
  9 predictor

Pre-processing: Yeo-Johnson transformation (9) 
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 366, 367, 366, 366, 367, 367, ... 
Resampling results across tuning parameters:

  sigma  C   RMSE      Rsquared 
  0.025   1  4.799036  0.7545641
  0.025   2  4.598553  0.7706094
  0.025   3  4.536308  0.7755918
  0.025   4  4.510645  0.7781409
  0.025   5  4.483821  0.7804615
  0.025   6  4.458433  0.7824321
  0.025   7  4.437082  0.7839830
  0.025   8  4.422740  0.7850595
  0.025   9  4.407503  0.7863258
  0.025  10  4.394818  0.7875119
  0.025  11  4.386376  0.7884606
  0.025  12  4.380142  0.7891152
  0.025  13  4.372431  0.7899239
  0.025  14  4.365381  0.7906252
  0.025  15  4.356737  0.7914507
  0.050   1  4.494527  0.7804428
  0.050   2  4.366527  0.7898645
  0.050   3  4.319540  0.7942297
  0.050   4  4.277365  0.7975284
  0.050   5  4.237597  0.8007294
  0.050   6  4.201159  0.8037178
  0.050   7  4.164792  0.8065783
  0.050   8  4.136943  0.8085899
  0.050   9  4.119564  0.8096662
  0.050  10  4.111661  0.8099835
  0.050  11  4.106955  0.8101161
  0.050  12  4.104801  0.8102737
  0.050  13  4.104199  0.8104075
  0.050  14  4.101161  0.8106189
  0.050  15  4.099134  0.8108185
  0.100   1  4.222472  0.8027630
  0.100   2  4.087170  0.8122060
  0.100   3  4.057028  0.8139461
  0.100   4  4.061125  0.8135993
  0.100   5  4.065650  0.8129651
  0.100   6  4.066467  0.8129084
  0.100   7  4.064548  0.8130995
  0.100   8  4.060870  0.8134430
  0.100   9  4.059575  0.8135766
  0.100  10  4.058182  0.8138386
  0.100  11  4.053929  0.8141974
  0.100  12  4.047732  0.8147264
  0.100  13  4.042849  0.8150944
  0.100  14  4.036432  0.8156718
  0.100  15  4.028118  0.8163635
  0.150   1  4.117026  0.8099370
  0.150   2  4.054519  0.8138563
  0.150   3  4.039556  0.8147772
  0.150   4  4.033269  0.8154278
  0.150   5  4.029878  0.8157925
  0.150   6  4.010496  0.8174739
  0.150   7  3.978291  0.8203540
  0.150   8  3.949712  0.8226881
  0.150   9  3.927998  0.8243310
  0.150  10  3.909650  0.8257058
  0.150  11  3.892513  0.8269691
  0.150  12  3.879805  0.8279124
  0.150  13  3.871650  0.8285214
  0.150  14  3.869265  0.8286673
  0.150  15  3.871681  0.8284644
  0.200   1  4.077672  0.8120301
  0.200   2  4.034397  0.8150402
  0.200   3  4.008699  0.8175385
  0.200   4  3.980353  0.8200456
  0.200   5  3.926935  0.8247752
  0.200   6  3.882278  0.8282661
  0.200   7  3.849855  0.8307015
  0.200   8  3.835873  0.8317017
  0.200   9  3.834053  0.8316854
  0.200  10  3.842898  0.8308844
  0.200  11  3.855413  0.8297713
  0.200  12  3.867875  0.8287257
  0.200  13  3.876857  0.8279960
  0.200  14  3.883568  0.8274430
  0.200  15  3.888463  0.8270008
  0.250   1  4.070606  0.8128804
  0.250   2  4.005490  0.8179382
  0.250   3  3.963227  0.8218315
  0.250   4  3.886139  0.8282597
  0.250   5  3.830741  0.8323303
  0.250   6  3.806068  0.8339396
  0.250   7  3.797364  0.8340582
  0.250   8  3.803819  0.8332427
  0.250   9  3.809590  0.8327111
  0.250  10  3.811915  0.8325831
  0.250  11  3.818065  0.8322439
  0.250  12  3.830207  0.8314948
  0.250  13  3.841450  0.8307685
  0.250  14  3.852209  0.8300835
  0.250  15  3.861949  0.8295233

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were sigma = 0.25 and C = 7.

In [72]:
set.seed(13)
predictions <- predict(fit.svm, newdata=validation, na.action=na.pass)
caret:::RMSE(predictions, validation$medv)
MLmetrics::RMSE(predictions, validation$medv)


8.52138738047781
8.52138738047781

In [ ]: